Abstract
Background The HIV inference group at Imperial
routinely uses TMB to implement complex spatio-temporal
statistical models because the scale and requirement for in-production
run-time makes MCMC infeasible, and no suitably flexible implementation
of INLA exists. However, inferences obtained from TMB are
not as accurate as from MCMC or INLA.
Task We compare inferences produced using tmbstan,
TMB and aghq for a collection of HIV evidence
synthesis models based upon the Naomi small-area estimation model.
Findings We find that, treating tmbstan
as the gold-standard, inference from TMB is substantially
worse than than from aghq .
Next steps Work on presentation of document. Understand performance for different parameters. Understand cause of poor performance.
Jeffrey W. Eaton et al. (2019) specify a joint model linking small-area estimation models of HIV prevalence from household surveys, HIV prevalence from antenatal care clinics, and antiretroviral therapy (ART) coverage from routine health data collection. This model forms the basis of the Naomi small-area estimation model (Jeffrey W. Eaton et al. 2021). Modelling data from multiple sources concurrently increases statistical power, and may mitigate the biases of any single source giving a more complete picture of the situation, as well as prompting investigation into any data conflicts. The model is described by three components, as follows.
Consider a country partitioned into areas \(i = 1, \ldots, n\). A simple random household survey of \(m_i\) people is conducted in each area, and \(y_i\) HIV positive cases are observed. Cases may be modelled using a binomial logistic regression model \[\begin{align} y_i &\sim \text{Bin}(m_i, \rho_i), \\ \text{logit}(\rho_i) &\sim \mathcal{N}(\beta_\rho, \sigma_\rho^2) \end{align}\] where HIV prevalence \(\rho_i\) is modelled by a Gaussian with mean \(\beta_\rho\) and standard deviation \(\sigma_\rho\).
Routinely collected data from pregnant women attending antenatal care clinics (ANCs) is another important source of information about the HIV epidemic. If, of \(m^\text{ANC}_i\) women, \(y^\text{ANC}_i\) are HIV positive, then an analogous binomial logistic regression model \[\begin{align} y^\text{ANC}_i &\sim \text{Bin}(m^\text{ANC}_i, \rho^\text{ANC}_i), \\ \text{logit}(\rho^\text{ANC}_i) &= \text{logit}(\rho_i) + b_i, \\ b_i &\sim \mathcal{N}(\beta_b, \sigma_b^2), \end{align}\] may be used to describe HIV prevalence amongst the sub-population of women attending ANCs. Reflecting the fact that prevalence in ANCs is related but importantly different to prevalence in the general population, bias terms \(b_i\) are used to offset ANC prevalence from HIV prevalence.
The number of people receiving treatment at district health facilities \(A_i\) also provides additional information about HIV prevalence. Districts with high prevalence are likely to have a greater number of people receiving treatment, and vice versa. ART coverage, defined to be the proportion of PLHIV currently on ART on district \(i\), is given by \(\alpha_i = A_i / \rho_i N_i\), where \(N_i\) is the total population of district \(i\) and assumed to be fixed. As such, ART coverage may also be modelled using a binomial logistic regression model \[\begin{align} A_i &\sim \text{Bin}(N_i, \rho_i \alpha_i), \\ \text{logit}(\alpha_i) &\sim \mathcal{N}(\beta_\alpha, \sigma_\alpha^2). \end{align}\]
We consider five models as described below. For each model we write a
TMB C++ template. As well as the standard TMB
inference approach, this template allows the model to be fit using Stan
via tmbstan and using adaptive Gaussian Hermite quadrature
via aghq.
| Model | Report | Components |
|---|---|---|
| 0 | prev-anc-art_model0 |
Prevalence (no random effects) |
| 1 | prev-anc-art_model1 |
Prevalence |
| 2 | prev-anc-art_model2 |
Prevalence, ANC |
| 3 | prev-anc-art_model3 |
Prevalence, ART |
| 4 | prev-anc-art_model4 |
Prevalence, ANC, ART |
We perform inference on simulated data from Model 4 generated (prev-anc-art_sim)
with the following parameter values:
| Parameter | Value |
|---|---|
| \(n\) | 30 |
| \(m_i\) | 250 |
| \(\beta_\rho\) | -2.4 |
| \(\sigma_\rho\) | 0.5 |
| \(m^\text{ANC}_i\) | 10^4 |
| \(\beta_b\) | -0.2 |
| \(\sigma_b\) | 0.1 |
| \(N_i\) | 10^5 |
| \(\beta_\alpha\) | 0.7 |
| \(\sigma_\alpha\) | 0.35 |
knitr::opts_chunk$set(
cache = TRUE,
autodep = TRUE,
cache.lazy = FALSE,
cache.comments = FALSE
)
options(scipen = 999)
cbpalette <- multi.utils::cbpalette()
library(tidyverse)
rstan::rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
source("functions.R")
Create true values dataframe:
true_values <- data.frame(
parameter = c(
"beta_prev",
"log_sigma_phi_prev",
"beta_anc",
"log_sigma_b_anc",
"beta_art",
"log_sigma_phi_art"
),
true_value = c(
-2.4,
log(sqrt(1 / 4)),
-0.2,
log(sqrt(1/ 100)),
0.7,
log(0.35)
)
)
Read in results created in reports:
# model0 <- readRDS("depends/results_model0.rds")
model1 <- readRDS("depends/results_model1.rds")
model2 <- readRDS("depends/results_model2.rds")
model3 <- readRDS("depends/results_model3.rds")
model4 <- readRDS("depends/results_model4.rds")
Boxplots of the mean and standard deviation for each parameter, where each datapoint is a different simulation replicate.
params <- c("beta_prev", "log_sigma_phi_prev")
draw_boxplots(model1, params = params) + labs(title = "Model 1")
draw_boxplots(model2, params = params) + labs(title = "Model 2")
draw_boxplots(model3, params = params) + labs(title = "Model 3")
draw_boxplots(model4, params = params) + labs(title = "Model 4")
params <- c("beta_prev", "log_sigma_phi_prev")
draw_scatterplots(model1, params = params) + labs(title = "Model 1")
draw_scatterplots(model2, params = params) + labs(title = "Model 2")
draw_scatterplots(model3, params = params) + labs(title = "Model 3")
draw_scatterplots(model4, params = params) + labs(title = "Model 4")
Take samples from all distributions, then compute maximum ECDF
difference \(D\) (two-sample
Kolmogorov–Smirnov test). We assume that tmbstan represents
the gold-standard, and compare performance of aghq and
TMB via their similarity to this gold-standard. Higher
values of \(D\) correspond to
distributions which are more different.
draw_ksplots_D(model1) + labs(title = "Model 1")
## Warning: Removed 4 rows containing missing values (geom_point).
draw_ksplots_D(model2) + labs(title = "Model 2")
## Warning: Removed 59 rows containing missing values (geom_point).
draw_ksplots_D(model3) + labs(title = "Model 3")
## Warning: Removed 3 rows containing missing values (geom_point).
draw_ksplots_D(model4) + labs(title = "Model 4")
## Warning: Removed 60 rows containing missing values (geom_point).
We could also assess \(l\), the location of \(D\). Determining if there are patterns in the location of the greatest ECDF difference could present us with useful insights.
# draw_ksplots_l(model1) + labs(title = "Model 1")
# draw_ksplots_l(model2) + labs(title = "Model 2")
# draw_ksplots_l(model3) + labs(title = "Model 3")
# draw_ksplots_l(model4) + labs(title = "Model 4")
When using Markov chain Monte Carlo (MCMC) methods, as we have for
tmbstan, it’s important to assess for convergence.
One way to do this is via traceplots which visualise the chains over the number of iterations specified.
map(model1, "mcmc_traceplots")
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
map(model2, "mcmc_traceplots")
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
map(model3, "mcmc_traceplots")
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
map(model4, "mcmc_traceplots")
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
The \(\hat R\) statistic (“R hat”) can also be used. A value of \(\hat R < 1.1\) is typically sufficient.
draw_rhatplot(model1) + labs(title = "Model 1")
draw_rhatplot(model2) + labs(title = "Model 2")
draw_rhatplot(model3) + labs(title = "Model 3")
draw_rhatplot(model4) + labs(title = "Model 4")